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Abstract.  This  paper  describes  our  participation  in  the  TREC  2009  Blog  Track. 
Our  system  consists  of  the  query  likelihood  component  and  the  news  headline 
prior  component,  based  on  the  language  model  framework.  For  the  query  likeli¬ 
hood,  we  propose  several  approaches  to  estimate  the  query  language  model  and 
the  news  headline  language  model.  We  also  suggest  two  approaches  to  choose  the 
10  supporting  relevant  posts:  Feed-Based  Selection  and  Cluster-Based  Selection. 
Furthermore,  we  propose  two  criteria  to  estimate  the  news  headline  prior  for  a 
given  day.  Experimental  results  show  that  using  the  prior  significantly  improves 
the  performance  of  the  task. 


1  Introduction 

Blog  track  explores  information  seeking  behavior  in  the  blogosphere.  Compared  with 
previous  Blog  track,  the  Blog  track  2009  aims  to  investigate  more  refined  and  complex 
search  scenarios  in  the  blogosphere.  In  TREC  2009,  the  Blog  track  has  two  main  tasks: 
Faceted  Blog  Distillation  Task  and  Top  Stories  Identification  Task. 

Among  two  tasks,  we  participate  in  the  Top  Stories  Identification  Task.  The  Top 
Stories  Identification  Task  is  a  new  pilot  search  task  addressing  the  news  dimension  in 
the  blogosphere.  Query  of  this  task  is  a  date  “query”.  For  a  given  date  query,  a  system 
should  rank  news  headlines  according  to  their  importance  on  the  given  day.  Further¬ 
more,  for  each  headline,  the  system  should  provide  10  supporting  blog  posts  which 
are  relevant  to  and  discuss  the  news  story  headline.  To  achieve  this,  the  system  will  be 
provided  with  news  headline  corpus  and  the  Blogs08  corpus,  and  could  use  external 
resources. 

For  the  Top  Stories  Identification  Task,  our  approach  consists  of  two  steps:  the  pre¬ 
processing  step  and  the  news  headline  ranking  step.  In  the  preprocessing  step,  HTML 
tags  and  non-relevant  contents,  provided  by  blog  providers  such  as  site  description  and 
menus,  are  removed.  In  the  news  headline  ranking  step,  we  make  two  language  mod¬ 
els,  the  date  query  language  model  and  the  headline  language  model,  and  estimate  the 
probability  that  each  news  headline  will  be  the  top  story. 

2  Preprocessing  Step 

TREC  Blog08  collection  contains  permalinks,  feed  files  and  blog  homepages.  We  only 
used  the  permalink  pages  for  the  top  stories  identification  task.  The  permalinks  are  en- 
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Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 
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coded  by  HTML,  and  there  are  many  different  styles  of  permalinks.  Beside  the  relevant 
textual  parts,  the  permalinks  contain  many  non-topical  or  non-relevant  contents  such  as 
HTML  tags,  advertisements,  site  descriptions,  and  menus. 

The  non-relevant  contents  consist  of  many  different  types  of  blog  templates  which 
may  be  provided  from  commercial  blog  service  venders.  We  used  the  DiffPost  algo¬ 
rithm  [5, 7]  to  deal  with  the  non-relevant  contents. 

To  preprocess  the  Blog08  corpus,  we  firstly  discarded  all  HTML  tags,  and  applied 
DiffPost  algorithm  to  remove  non-relevant  contents.  DiffPost  segments  each  document 
into  lines  using  the  carriage  return  as  a  separator.  DiffPost  tries  to  compare  sets  of  lines, 
and  then  regards  the  intersection  of  sets  as  the  non-content  information. 

For  example,  let  /)  and  Pj  be  blog  posts  within  the  same  blog  feed.  Let  .S',  and  Sj  be 
the  sets  of  lines  correspond  to  I)  and  Pj,  respectively. 

Noisyln format ion(Pj,Pj)  =  S',  fi  Sj  (1) 

We  discarded  non-relevant  contents  through  the  set  difference  between  a  document 
and  noisy-information.  Finally,  we  removed  stopwords  from  the  content  results  of  the 
DiffPost  algorithm. 


3  News  Headline  Ranking  Step 

The  Top  Stories  Identification  Task  aims  to  find  important  news  headlines  for  a  given 
date  query.  Let  H  be  news  headline  and  Q  be  a  date  query.  To  estimate  the  importance 
of  news  headline  on  a  date  query,  we  used  a  language  model  framework,  widely  used 
for  many  information  retrieval  tasks. 

P(H\Q)  oc  P(Q\H)  PW  (2) 

Query  Headline 
Likelihood  Prior 

That  is,  we  assume  that  for  a  given  date  query  the  importance  of  news  headlines  can 
be  estimated  using  the  probability  that  a  query  language  model  generates  each  news 
headline.  We  evaluate  each  component  using  only  Blog08  corpus  and  news  headline 
corpus  without  resorting  to  any  external  resources. 

3.1  The  Query  Likelihood 

For  the  query  likelihood,  we  should  estimate  two  language  models,  the  Query  Language 
Model  (QLM)  and  the  News  Headline  Language  Model  (NHLM).  Both  language  mod¬ 
els  are  estimated  using  the  contents  of  blog  posts. 


The  Query  Language  Model  We  gather  blog  posts  between  -3  and  +7  days  for  a  given 
query  (date  query).  We  believe  that  if  a  news  headline  is  important  on  a  given  day,  many 
blog  posts  relevant  to  the  news  topic  were  posted  near  the  day. 


The  gathered  documents  contain  not  only  the  information  relevant  to  the  news  topic 
but  also  background  information  or  non-relevant  topics.  We  assume  that  the  documents 
are  generated  by  a  mixture  model  of  the  QLM  and  the  collection  language  model.  Let 
QD  =  {d\ ,  d2 ,  ■  ■  ■ ,  d„  }  be  the  documents  between  -3  and  +7  days  for  a  given  query. 

P(QD)  =nn((1-X)P(vvl0eiM)+^(w|0c))c(w;'/')  (3) 

i  w 


where  Qqlm  is  a  query  language  model,  P(w|0c)  =  ifir-  ctfw  is  the  number  of  times 
term  w  occurred  in  the  entire  collection,  |C|  is  the  length  of  the  collection,  c(w~,di)  is 
the  count  of  a  word  vv  in  a  document  <7,  and  A,  is  a  weighting  parameter  1 . 

Then,  we  can  estimate  Qqlm  using  the  EM  algorithm  [1],  The  EM  updates  for 
P(w\Xqlm)  are  as  follows: 


t 


(l-X)P"(w\QQLM) 


o»+ 1 


(1  —  X)Pn  (w\&qlm)  +  XPn  (w\Qc) 

rj=ic(w,dj)c 


(H  Qqlm)  = 


(4) 

(5) 


The  News  Headline  Language  Model  To  learn  the  NHLM,  for  each  news  headline, 
we  first  retrieved  blog  posts  relevant  to  its  topic.  We  evaluate  the  relevance  between  a 
news  headline  H  and  a  blog  post  d  using  the  KL-divergence  language  model  [4]  with 
Dirichlet  smoothing  [8], 

Score(H ,  d)  =  -  )  log  WTTW  (6) 

W  P{w\d) 


where  P(w\H )  is  the  maximum  likelihood  estimates  of  the  news  headline,  and  P(w\d)  = 
C^W'd^\d\+PJ'™\Qc'1 :  Pd  *s  a  smoothing  parameter  2 . 

We  gather  only  blog  posts  between  -7  and  +28  days  for  a  given  date  query  among 
the  search  results.  We  then  choose  10  supporting  relevant  posts.  When  selecting  10 
supporting  posts,  we  want  them  to  capture  diverse  aspects  or  opinions  relevant  to  the 
news  story.  To  achieve  this,  we  propose  two  different  approaches. 

One  is  the  Feed-Based  Selection  (FBS)  that  selects  supporting  relevant  posts  based 
on  feed  information  which  belongs  to  them.  This  approach  selects  10  supporting  rele¬ 
vant  posts  from  as  various  blog  feeds  as  it  possible.  To  this  end,  we  choose  at  most  two 
blog  posts  from  each  blog  feed  according  to  their  relevance  scores  from  Eq.  6. 

The  other  is  the  Cluster-Based  Selection  (CBS)  which  first  groups  search  results 
into  10  clusters  and  chooses  one  representative  document  from  each  cluster.  To  cluster 
search  results,  we  use  K-Medoid  clustering  algorithm  [3],  and  J-Divergence  [6]  as  the 
distance  function. 


J(di\\dj)  =  £p(w|9,-)/og^|l^')) 


Y,P(w\®j)l°8 


pWQj) 

p{w\Qi) 


(7) 


1  In  our  runs,  we  set  X  =  0.7 

2  In  our  runs,  we  set  /jj  =  1000 


We  estimate  the  NHLM  using  the  maximum  likelihood  estimate  of  the  10  support¬ 
ing  relevant  posts  and  the  Dirichlet  smoothing  [8]. 

Let  Qnhlm  and  9c  be  the  NHLM  and  the  collection  language  model,  respectively. 
Let  SD  be  a  set  of  the  10  supporting  relevant  posts. 

=  (8) 
\SD\+fi 

where  c(w;SD)  is  the  count  of  a  word  vv  in  the  document  set  SD  and  //  is  a  smoothing 
parameter 1 *  3 . 


The  Score  Function  To  evaluate  the  query  likelihood,  we  use  KL-divergence  language 
model  [4]  to  rank  news  headlines  in  response  to  a  given  date  query. 

Let  ScoreQif](Q,H)  be  the  relevance  score  of  the  news  headline  H  with  respect  to 
a  given  date  query  Q. 

ScoreQLH(Q,H )  =  Y,p(w\QQLM)logP(w\QNHLM)  +const(Q)  (9) 


3.2  The  News  Headline  Prior 

We  propose  two  criteria  to  estimate  the  news  headline  prior  that  it  will  be  a  top  story 
for  a  given  day.  We  consider  these  criteria  as  the  priors  of  a  news  headline  in  that  they 
are  independent  of  the  query  language  model. 


The  Temporal  Profiling  The  Temporal  Profiling  criterion  uses  time  information  of 
blog  posts  relevant  to  each  news  headline.  We  made  a  temporal  profile  of  each  news 
headline  using  a  temporal  profiling  approach  that  Diaz  and  Jones  [2]  proposed.  The 
temporal  profile  of  the  news  headline  H  is  defined  as  follows: 


P{t\H)='£P(t\d) 

d£R 


P(H\d) 

Zd,eRP(H\d') 


(10) 


where  R  is  the  document  set  which  contains  top  500  blog  posts  among  the  search  results 
from  Eq.  6,  and  P(H\d)  is  approximated  using  Score(H ,d),  and 


1  if  t  is  equal  to  the  document  date 
0  otherwise 


We  then  smooth  the  temporal  profile  P(t\H)  using  the  background  model  as  follows: 


P(t\H)  =  {l-a)P(t\H)+aP{t\C)  (11) 

where  a  is  a  smoothing  parameter  and  P(t  |C)  =  pjr  \d).  |D|  is  the  total  number 

of  documents  in  the  entire  collection. 

3  We  set  /j= 2000 


This  temporal  profile  is  defined  at  each  single  day.  However,  the  blog  posts  about  the 
important  news  story  may  occur  over  a  period  of  several  days.  Therefore,  we  smooth 
the  temporal  profile  model  with  the  model  for  adjacent  days.  Let  Scorepp(H )  be  the 
value  of  a  news  headline  estimated  using  the  temporal  profile  of  the  news  headline. 

ScoreTP(Q,H)  =  +  ^\H)  (12) 


where  <f>  indicates  the  period  4. 


The  Term  Importance  The  Term  Importance  criterion  uses  term  information  of  each 
news  headline.  We  believe  that  respective  terms  have  a  different  importance  for  a  given 
date  query.  If  a  headline  consists  of  more  important  terms,  it  is  more  likely  to  be  a  top 
story.  For  example,  a  headline  consisted  of  common  words  or  stopwords  may  not  a  top 
story. 

To  estimate  the  term-based  evidence,  we  first  extracted  all  NP  phrases  from  each 
news  headline  using  Stanford  Parser  5.  We  then  gathered  all  n-gram  (n  <  3)  from  NP 
phrases.  We  evaluate  the  importance  of  the  n-gram  terms  using  two  heuristic  measures. 

One  is  term  frequency  measure  that  is  naive,  but  widely  used  in  the  many  IR  tasks. 
Intuitively,  if  a  term  occurs  frequently  at  a  query  date,  it  is  more  likely  to  be  important. 
Let  nt  be  n-gram  term  and  TF(nt:  Q)  be  a  term  frequency  measure  for  a  date  query  Q. 

TF(nt,Q)  =  log(l+c(nt;Q))  (13) 

where  c(nt ;  Q)  means  the  count  of  a  term  nt  within  the  news  headlines  that  are  issued 
at  the  same  date  with  a  given  date  query  Q. 

The  other  is  distribution  within  the  news  headline  corpus.  We  believe  that  if  a 
term  occurrence  is  concentrated  at  a  query  date,  it  is  more  likely  to  be  important. 
Let  DI(nt,Q)  be  the  measure  of  the  term  distribution  for  a  date  query  Q.  We  define 
DI(nt,Q)  as  follows: 


DI(nt,Q) 


c(nt;Q ) 
LteTc(nf,t ) 


(14) 


where  T  indicate  a  set  of  days  corresponding  to  timespan  covered  by  the  news  headline 
corpus. 

Let  Scorepj(Q,H)  be  the  value  of  a  news  headline  evaluated  based  on  the  term- 
based  evidence. 


Scorepi{Q,Fl)  =  ma  \(TF(nt,Q)xDI(nt,Q))  (15) 

nt 


4  In  our  runs,  the  period  <])  is  between  -3  and  +7  days  from  the  query  day  Q 

5  http://nlp.stanford.edu/software/lex-parser.shtml 


Run 

MAP  P@10 

KLEFeed 

KLECluster 

KLEFeedPrior 

KLEClusPrior 

0.0132  0.0345 

0.0182  0.0600 
0.1548  0.2800 

0.1605  0.2964 

Table  1.  The  performances  based  on  the  headlines  relevance  judgments  only 


Run 

NDCG-alpha@10  intent-aware  P@10 

KLEFeed 

0.066 

0.023 

KLECluster 

0.098 

0.037 

KLEFeedPrior 

0.504 

0.162 

KLEClusPrior 

0.409 

0.117 

Table  2.  The  performances  based  on  the  blog-post  level  evaluation 


3.3  The  Ranking  Fuction 

From  above  steps,  we  can  get  three  scores  related  to  the  importance  of  the  news  head¬ 
line.  To  rank  the  news  headlines,  we  should  combine  the  score  of  the  query-likelihood 
with  two  measures  for  the  headline  prior.  However,  each  score  has  different  scale. 
Therefore,  a  value  of  each  score  is  scaled  from  0  to  1.  Finally,  our  ranking  function 
is  as  follows: 

Score(H,Q)  =  (1  —  Pi  )ScoreQLH{Q,H) 

+  pi{(l  -V2)ScoreTp(Q,H)  +  p2ScoreTi(Q,H)}  (16) 

where  pi  is  the  weighting  parameter  that  controls  the  importance  between  the  query 
likelihood  and  the  headline  prior  components,  and  P2  controls  the  weight  between  two 
evidences  for  the  headline  prior. 


4  Runs 

In  the  Top  Stories  Identification  Task,  we  submitted  4  runs  as  follows: 

1.  KLEFeed  chooses  10  supporting  blog  posts  using  the  FBS  to  estimate  the  NHLM, 
and  does  not  use  the  news  headline  prior,  that  is.  Pi  =  0. 

2.  KLECluster  chooses  10  supporting  blog  posts  using  the  CBS  to  estimate  the  NHLM, 
and  does  not  use  the  news  headline  prior. 

3.  KLEFeedPrior  chooses  10  supporting  blog  posts  using  the  FBS  to  learn  the  NHLM, 
and  uses  the  news  headline  prior  with  pi  =  0.7,  P2  =  0.2. 

4.  KLEClusPrior  chooses  10  supporting  blog  posts  using  the  CBS  to  learn  the  NHLM, 
and  uses  the  news  headline  prior  with  pi  =  0.7,  P2  =  0.2. 

Table  1  and  2  shows  the  performances  of  our  runs  based  on  the  headlines  relevance 
judgments  only  and  based  on  the  blog-post  level  evaluation,  respectively. 


5  Conclusions 


We  have  described  our  participation  in  the  TREC  2009  Blog  Track.  For  Top  Stories 
Identification  Task,  we  presented  two  components:  the  query  likelihood  and  the  news 
headline  prior,  based  on  the  language  model  framework.  To  evaluate  the  query  likeli¬ 
hood,  we  estimated  the  query  language  model  and  the  news  headline  language  model 
using  the  contents  of  blog  posts.  We  also  proposed  two  methods  to  choose  the  10  sup¬ 
porting  blog  posts  for  each  news  headline:  FBS  and  CBS.  Furthermore,  we  suggested 
two  criteria  to  estimate  the  news  headline  prior.  One  is  Temporal  Profiling  criterion  that 
uses  time  information  of  blog  posts  relevant  to  each  headline.  The  other  is  Term  Im¬ 
portance  criterion  that  uses  term  information  of  a  news  headline.  Experimental  results 
show  that  using  the  news  headline  prior  significantly  improves  the  performance. 
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